The Taxi data used in this exploration was downloaded from http://www.andresmh.com/nyctaxitrips/ and joined with weather data from https://weatherspark.com/. This data set is very large and only a small sample is used here. This sample contains 1% of the data for 4 month (January to April 2013).
## [1] "C"
# show the structure of the taxi data
str(taxi)
## 'data.frame': 60003 obs. of 28 variables:
## $ Medallion : Factor w/ 12934 levels "00005007A9F30E289E760362F69E4EAD",..: 8407 12562 11442 1754 2919 6413 6237 7548 6036 8866 ...
## $ Vendor : Factor w/ 2 levels "CMT","VTS": 2 2 2 2 2 2 2 2 2 2 ...
## $ pickup.datetime : chr "2013-04-15 20:30:00.0" "2013-04-15 23:55:00.0" "2013-04-16 07:59:00.0" "2013-04-04 13:41:00.0" ...
## $ dropoff.datetime : chr "2013-04-15 20:39:00.0" "2013-04-15 23:59:00.0" "2013-04-16 08:14:00.0" "2013-04-04 13:46:00.0" ...
## $ passenger.count : Factor w/ 7 levels "0","1","2","3",..: 2 3 4 2 2 2 2 2 6 2 ...
## $ trip.time : int 540 240 900 300 1380 480 600 480 600 300 ...
## $ trip.distance : num 1.63 1.33 2.86 0.57 9.58 0.77 2.63 2.69 2.35 1.28 ...
## $ pickup.longitude : num -74 -73.9 -74 -74 -74 ...
## $ pickup.latitude : num 40.8 40.8 40.8 40.8 40.8 ...
## $ dropoff.longitude: num -74 -74 -74 -74 -73.9 ...
## $ dropoff.latitude : num 40.8 40.8 40.8 40.7 40.8 ...
## $ payment.type : Factor w/ 5 levels "CRD","CSH","DIS",..: 2 2 1 2 1 2 1 1 2 2 ...
## $ fare.amount : num 8.5 6 12.5 5 29 7 10.5 9.5 10 6.5 ...
## $ Surcharge : num 0.5 0.5 0 0 0 0 0 0.5 0 0 ...
## $ mta.tax : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ tip.amount : num 0 0 1 0 5 0 3 2 0 0 ...
## $ tolls.amount : num 0 0 0 0 5.33 0 0 0 0 0 ...
## $ total.amount : num 9.5 7 14.5 5.5 40.6 ...
## $ Year : Factor w/ 1 level "2013": 1 1 1 1 1 1 1 1 1 1 ...
## $ Month : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ Day : Factor w/ 31 levels "1","10","11",..: 7 7 8 26 26 22 16 15 22 16 ...
## $ Hour : Factor w/ 24 levels "0","1","2","3",..: 21 24 8 14 15 8 8 24 8 7 ...
## $ Temperature : num 9.4 8.9 9.4 8.3 9.4 11.7 5 7.8 11.7 5 ...
## $ Precipitation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pickup.date : Date, format: "2013-04-15" "2013-04-15" ...
## $ weekday : Factor w/ 7 levels "Monday","Tuesday",..: 1 1 2 4 4 1 2 1 1 2 ...
## $ weekend : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ rain : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# show the factors
levels(taxi$Vendor)
## [1] "CMT" "VTS"
levels(taxi$passenger.count)
## [1] "0" "1" "2" "3" "4" "5" "6"
levels(taxi$Hour)
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
## [15] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23"
levels(taxi$weekday)
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
## [7] "Sunday"
levels(taxi$weekend)
## [1] "no" "yes"
levels(taxi$rain)
## [1] "no" "yes"
# remove NAs from the data set
taxi = na.omit(taxi)
# a summary of the data
summary(taxi)
## Medallion Vendor pickup.datetime
## 6DFD37A4BDC448C365B36465D73A4CCE: 18 CMT:30506 Length:59894
## 5E9BC99D16CFF51F5BB3361660713D1D: 16 VTS:29388 Class :character
## 630D390996EED8370C1E5494B9EFED1F: 16 Mode :character
## 6F9EC82D4E5C8B03A93FC1C5DAB16465: 16
## 7A244CCB309CDA8071892F11902ADC5C: 16
## 0FF5BCE95C86107079CED655BD5D5BCA: 15
## (Other) :59797
## dropoff.datetime passenger.count trip.time trip.distance
## Length:59894 0: 0 Min. : 0.0 Min. : 0.000
## Class :character 1:42365 1st Qu.: 360.0 1st Qu.: 1.010
## Mode :character 2: 7998 Median : 600.0 Median : 1.720
## 3: 2451 Mean : 713.8 Mean : 2.804
## 4: 1199 3rd Qu.: 910.0 3rd Qu.: 3.100
## 5: 3578 Max. :8461.0 Max. :69.560
## 6: 2303
## pickup.longitude pickup.latitude dropoff.longitude dropoff.latitude
## Min. :-74.54 Min. : 0.00 Min. :-74.83 Min. : 0.00
## 1st Qu.:-73.99 1st Qu.:40.74 1st Qu.:-73.99 1st Qu.:40.73
## Median :-73.98 Median :40.75 Median :-73.98 Median :40.75
## Mean :-72.63 Mean :40.01 Mean :-72.57 Mean :39.99
## 3rd Qu.:-73.97 3rd Qu.:40.77 3rd Qu.:-73.96 3rd Qu.:40.77
## Max. : 0.00 Max. :45.60 Max. : 0.00 Max. :73.98
##
## payment.type fare.amount Surcharge mta.tax
## CRD:32063 Min. : 2.50 Min. :0.0000 Min. :0.0000
## CSH:27629 1st Qu.: 6.50 1st Qu.:0.0000 1st Qu.:0.5000
## DIS: 46 Median : 9.00 Median :0.0000 Median :0.5000
## NOC: 117 Mean : 11.97 Mean :0.3201 Mean :0.4983
## UNK: 39 3rd Qu.: 13.50 3rd Qu.:0.5000 3rd Qu.:0.5000
## Max. :356.00 Max. :1.5000 Max. :0.5000
##
## tip.amount tolls.amount total.amount Year
## Min. : 0.000 Min. : 0.0000 Min. : 3.00 2013:59894
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 8.00
## Median : 1.000 Median : 0.0000 Median : 10.80
## Mean : 1.157 Mean : 0.2238 Mean : 14.33
## 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 16.00
## Max. :110.000 Max. :18.5000 Max. :356.00
##
## Month Day Hour Temperature
## 1:14826 23 : 2232 19 : 3802 Min. :-11.700
## 2:13941 16 : 2146 18 : 3670 1st Qu.: 0.600
## 3:15794 22 : 2114 20 : 3539 Median : 4.400
## 4:15333 15 : 2109 21 : 3524 Mean : 4.951
## 2 : 2082 22 : 3355 3rd Qu.: 8.900
## 19 : 2051 14 : 3112 Max. : 27.800
## (Other):47160 (Other):38892
## Precipitation pickup.date weekday weekend
## Min. :0.00000 Min. :2013-01-01 Monday :7554 no :42802
## 1st Qu.:0.00000 1st Qu.:2013-02-01 Tuesday :8724 yes:17092
## Median :0.00000 Median :2013-03-03 Wednesday:8410
## Mean :0.09256 Mean :2013-03-02 Thursday :8803
## 3rd Qu.:0.00000 3rd Qu.:2013-04-01 Friday :9311
## Max. :8.89000 Max. :2013-04-30 Saturday :9121
## Sunday :7971
## rain
## no :55143
## yes: 4751
##
##
##
##
##
Most taxi trips are just short. The mean trip distance is below 3 miles. The mean passenger count is 1.7. The median tip is just $1 but the maximum is $110. The mean trip fare is $14.33 and the maximum is $356.
Looking at the taxi rides per day shows a distribution between 400 and 600. The monday has the lowest number of rides with a median of 450. The highest median of taxi rides is on Friday and Saturday with a higher variance on Saturday. There are not many taxi rides when it is raining.
The distribution of fares over days shows that most trips are below $25 (the mean is $14.33). But a line above $50 stands out that should be investiged in more detail.
Looking at trip fares above $50 we can see that the rate of $52 is very frequent.
# subset of taxi trips with fare > 50$
taxi50 = taxi[taxi$fare.amount > 50,]
table(taxi50$fare.amount)
##
## 50.5 51 51.5 52 52.5 53 53.5 54 54.5 55
## 6 8 7 984 8 7 6 8 5 8
## 55.5 56 56.5 57 57.3 57.5 58 58.5 59 59.5
## 2 7 7 1 1 6 9 3 4 3
## 60 60.5 61 61.5 62 62.5 63 63.5 64 64.5
## 18 7 3 4 5 4 3 2 9 5
## 65 65.5 66 66.5 67 67.5 68 68.5 69 69.5
## 7 2 4 1 4 4 5 4 4 4
## 70 70.5 71 72 72.5 73 74 74.5 75 76
## 8 2 2 4 2 2 3 1 2 1
## 76.5 77 77.5 79 80 80.5 81 84 84.5 85
## 2 1 1 1 7 1 3 1 1 3
## 85.5 86.01 88 90 93.5 97 98 100 102 102.5
## 1 1 1 1 1 1 1 5 2 1
## 106.5 110 112 115 118.5 120 122.22 123 125 130
## 1 1 1 1 2 3 1 1 1 2
## 140 163 178 200 202 204 250 268 356
## 1 1 1 1 1 1 1 1 1
This high frequency of this special rate looks like a fixed price offering from/to the airport. To verify this, the geo coordinates are checked to look at the start and end point of the trips.
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.7,-73.9&zoom=11&size=640x640&scale=2&maptype=roadmap&language=en-EN&sensor=false
The maps show that the special price trips start/end at the airport and end/start in Manhattan.
Arriving the airport in time is important. Therefore we look at the time it takes to ride from the city of Manhattan to the airport.
The trip time to the airport is dependent on the hour of the day. Around 4pm in the afternoon it takes much longer than around 5am. The trip times on the weekend are mostly below the fitting line, showing that it is easier to drive to the airport on weekends.
As it is important to know how long it will take to drive to the airport a prediction model is created. The model is not giving a good prediction (R squared value below 0.35). The reason is the small dataset and the large variance of the trip times on some hours of the day. The graph below shows the distribution of trip times to the airport, the predicted trip times (in blue) and the upper limit of the 99% confidence interval (in red).
lm1 = lm(trip.time ~ Hour, data=taxi50.to.ap)
summary(lm1)
##
## Call:
## lm(formula = trip.time ~ Hour, data = taxi50.to.ap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1525.2 -404.7 -81.1 243.5 5928.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2249.50 375.63 5.989 5.51e-09 ***
## Hour1 -579.50 650.60 -0.891 0.3737
## Hour3 -697.75 531.22 -1.313 0.1899
## Hour4 -621.57 422.75 -1.470 0.1424
## Hour5 -670.54 406.98 -1.648 0.1004
## Hour6 -513.42 403.49 -1.272 0.2041
## Hour7 -407.68 438.64 -0.929 0.3533
## Hour8 57.83 451.45 0.128 0.8981
## Hour9 27.21 470.87 0.058 0.9539
## Hour10 -224.60 444.45 -0.505 0.6137
## Hour11 -274.07 425.92 -0.643 0.5204
## Hour12 -21.75 405.72 -0.054 0.9573
## Hour13 107.58 405.72 0.265 0.7910
## Hour14 539.17 399.88 1.348 0.1785
## Hour15 895.71 397.74 2.252 0.0250 *
## Hour16 722.28 402.49 1.795 0.0736 .
## Hour17 739.74 404.56 1.828 0.0684 .
## Hour18 533.02 406.98 1.310 0.1912
## Hour19 282.96 429.55 0.659 0.5105
## Hour20 -338.42 433.74 -0.780 0.4358
## Hour21 4.25 531.22 0.008 0.9936
## Hour22 -610.50 503.96 -1.211 0.2266
## Hour23 -229.17 451.45 -0.508 0.6121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 751.3 on 331 degrees of freedom
## Multiple R-squared: 0.3461, Adjusted R-squared: 0.3027
## F-statistic: 7.965 on 22 and 331 DF, p-value: < 2.2e-16
# build a test data set to predict (5am to 8pm on workdays)
test=data.frame(Hour=factor(c(5:20),levels=c(5:20)), weekend="no")
# do prediction on test data
pred = predict(lm1, test, interval = c("confidence"), level = 0.99)
pred = data.frame(pred)
test$fit = pred$fit
test$upr = pred$upr
# show predicted data and upper limit for CI
test
## Hour weekend fit upr
## 1 5 no 1578.957 1984.792
## 2 6 no 1736.077 2117.781
## 3 7 no 1841.818 2428.655
## 4 8 no 2307.333 2956.106
## 5 9 no 2276.714 3012.353
## 6 10 no 2024.900 2640.380
## 7 11 no 1975.429 2495.604
## 8 12 no 2227.750 2625.041
## 9 13 no 2357.083 2754.374
## 10 14 no 2788.667 3144.014
## 11 15 no 3145.212 3484.023
## 12 16 no 2971.778 3346.347
## 13 17 no 2989.240 3378.504
## 14 18 no 2782.522 3188.357
## 15 19 no 2532.462 3072.273
## 16 20 no 1911.083 2472.937
The taxi fare per mile and speed are distributed as expected. The mean fare per mile is 5.87$ and the mean speed is 13.3 miles per hour. The histogram for the fair per mile has a long tail. Cutting of the upper 0.01% of the data gives a better overview of the distribution.
# Subselecting taxi trips with distance and trip time > 0
taxi0 = taxi[taxi$trip.distance>0 & taxi$trip.time>0,]
# calculate new varaibles for fare.per.mile and speed
taxi0$fare.per.mile = taxi0$fare.amount/taxi0$trip.distance
taxi0$speed = taxi0$trip.distance/taxi0$trip.time*3600
# omit trips with speed too high
taxi0 = taxi0[taxi0$speed < 100,]
# distribution of new variables
summary(taxi0$fare.per.mile)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1008 4.0450 5.0000 5.8690 6.3640 700.0000
summary(taxi0$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.060 8.829 11.830 13.270 15.900 96.920
Comparing the fair per mile for workdays and weekends shows that the fare is slightly higher for workdays. This might be due to the higher traffic compared to the weekend.
Looking at the speed per weekday you can see that the median speed for Sunday is the highest and Friday has the lowest.
Comparing the speed over the hour of day shows smaller number of taxi rides in the early morning hours and the slowest speeds in the early afternoon. The smoothed line for the mean of the speed visualizes the dependency of speed from hour of day.
Seperating the speed by weekend or workdays in a boxplot shows that nightly trips on weekends (Saturday or Sunday morning) are slower than on worksdays. During daytime the speeds are faster on weekends than on workdays.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Before comparing taxi rides on weather data three plots visualize the weather data. The temperature is shown as points and tiles whereas the rain is shown as tiles only. There are some “holes”, where no weather data is available, because no taxi trips are in the sample at that point in time.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Comparing the speed of the taxis by rain visually gives no information. Looking at the mean and median values, a small difference visible. Taxis are slower and the fares are higher, when it is raining. Also, the trip distance is shorter when it is raining, pointing at the fact that a taxi might be used even for smaller distances.
by(taxi0[,c("rain","speed","trip.distance","tip.amount")], taxi0$rain, summary)
## taxi0$rain: no
## rain speed trip.distance tip.amount
## no :54692 Min. : 0.06 Min. : 0.01 Min. : 0.00
## yes: 0 1st Qu.: 8.88 1st Qu.: 1.05 1st Qu.: 0.00
## Median :11.91 Median : 1.76 Median : 1.00
## Mean :13.35 Mean : 2.83 Mean : 1.14
## 3rd Qu.:16.00 3rd Qu.: 3.14 3rd Qu.: 2.00
## Max. :96.92 Max. :69.56 Max. :110.00
## --------------------------------------------------------
## taxi0$rain: yes
## rain speed trip.distance tip.amount
## no : 0 Min. : 0.600 Min. : 0.030 Min. : 0.000
## yes:4715 1st Qu.: 8.258 1st Qu.: 1.000 1st Qu.: 0.000
## Median :10.935 Median : 1.700 Median : 1.000
## Mean :12.358 Mean : 2.735 Mean : 1.159
## 3rd Qu.:14.667 3rd Qu.: 3.000 3rd Qu.: 2.000
## Max. :71.111 Max. :24.700 Max. :20.000
The distribution of tips over trip distance shows that many people don’t give tips at all. For longer distances, the tips rises a little bit but on average less than $1 a mile.
Even more interesting is the fact that the mean tip is less for 3 or 4 passengers in a taxi than it is for 1, 2 or 5, and 6 passengers. I have no idea what might be the cause of this fact.
## taxi0$passenger.count: 0
## NULL
## --------------------------------------------------------
## taxi0$passenger.count: 1
## tip.amount passenger.count
## Min. : 0.000 0: 0
## 1st Qu.: 0.000 1:41971
## Median : 1.000 2: 0
## Mean : 1.162 3: 0
## 3rd Qu.: 2.000 4: 0
## Max. :53.000 5: 0
## 6: 0
## --------------------------------------------------------
## taxi0$passenger.count: 2
## tip.amount passenger.count
## Min. : 0.000 0: 0
## 1st Qu.: 0.000 1: 0
## Median : 0.000 2:7952
## Mean : 1.107 3: 0
## 3rd Qu.: 2.000 4: 0
## Max. :40.000 5: 0
## 6: 0
## --------------------------------------------------------
## taxi0$passenger.count: 3
## tip.amount passenger.count
## Min. : 0.000 0: 0
## 1st Qu.: 0.000 1: 0
## Median : 0.000 2: 0
## Mean : 1.008 3:2440
## 3rd Qu.: 1.000 4: 0
## Max. :18.000 5: 0
## 6: 0
## --------------------------------------------------------
## taxi0$passenger.count: 4
## tip.amount passenger.count
## Min. : 0.0000 0: 0
## 1st Qu.: 0.0000 1: 0
## Median : 0.0000 2: 0
## Mean : 0.9723 3: 0
## 3rd Qu.: 1.0000 4:1191
## Max. :16.0000 5: 0
## 6: 0
## --------------------------------------------------------
## taxi0$passenger.count: 5
## tip.amount passenger.count
## Min. : 0.000 0: 0
## 1st Qu.: 0.000 1: 0
## Median : 1.000 2: 0
## Mean : 1.134 3: 0
## 3rd Qu.: 2.000 4: 0
## Max. :15.000 5:3560
## 6: 0
## --------------------------------------------------------
## taxi0$passenger.count: 6
## tip.amount passenger.count
## Min. : 0.000 0: 0
## 1st Qu.: 0.000 1: 0
## Median : 0.000 2: 0
## Mean : 1.115 3: 0
## 3rd Qu.: 1.000 4: 0
## Max. :110.000 5: 0
## 6:2293
The distribution of fares over days shows that most trips are below $25 (the mean is $14.33). But a line above $50 stands out. A special fix price offering to get from the city to the airport and backward might be the reason.
The trip time to the airport is dependent on the hour of the day. Around 4pm in the afternoon it takes much longer than around 5am. The trip times on the weekend are mostly below the fitting line, showing that it is easier to drive to the airport on weekends.
The distribution of tips over trip distance shows that many people don’t give tips at all. For longer distances, the tips rises a little bit but on average less than $1 a mile.
The NYC Taxi data set contains a large amount of data. It has more than 10 million taxi trips per month. In this investigation just a small subset of the trip data is analyzed (60.000 trips over 4 month).
I started by looking at the distribution of taxi trips over date and weekdays but this gives not much information, as the distribution is almost even. Looking at the taxi fares showed a special price and it turned out that going from Manhattan to JFK airport or airport to Manhattan is the dominant taxi trip behind this special rate. Looking into trip times from Manhattan to JFS airport shows a strong dependency on hour of day. In the afternoon it takes 2x more time to arrive at the airport. I created a prediction model for this but the quality of this model was not very good based on the sample. It would be interesting to create and test the model with the full data set. Another interesting insight was that the mean tip amount is just around $1. As a German I have expected much higher tips. It is also interesting that with 4 passengers in a taxi the mean of tips is the lowest.
There are numerous relationships that are not investigated here. E.g. the trip distance over time or the number of passengers compared by night or day could be also interesting. Looking at the full data set would give additional insights, that are not possible from the sample. E.g. it could be analyzed how many taxis are on the street in a given time frame or how many passengers each taxi has or how long taxis have to wait between customers.